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ABSTRACT 

Motivation: Several applications in bioinformatics, such as genome 
assemblers and error corrections methods, rely on counting and 
keeping track of fc-mers (substrings of length k). Histograms of 
fc-mer frequencies can give valuable insight into the underlying 
distribution and indicate the error rate and genome size sampled in 
the sequencing experiment. 

Results: We present KmerStream, a streaming algorithm for 
computing statistics for high throughput sequencing data based on 
the frequency of fc-mers. The algorithm runs in time linear in the 
size of the input and the space requirement are logarithmic in the 
size of the input. This very low space requirement allows us to deal 
with much larger datasets than previously presented algorithms. We 
derive a simple model that allows us to estimate the error rate of 
the sequencing experiment, as well as the genome size, using only 
the aggregate statistics reported by KmerStream and validate the 
accuracy on sequences from a PhiX control. 

As an application we show how KmerStream can be used to 
compute the error rate of a DNA sequencing experiment. We run 
KmerStream on a set of 2656 whole genome sequenced individuals 
and compare the error rate to quality values reported by the 
sequencing equipment. We discover that while the quality values 
alone are largely reliable as a predictor of error rate, there is 
considerable variability in the error rates between sequencing runs, 
even when accounting for reported quality values. 
Availability: The tool KmerStream is written in C++ and 
is released under a GPL license. It is freely available at 
https://github.com/pmelsted/KmerStream 
Contact: pmelsted@hi.is 



1 INTRODUCTION 

fc-mers are one of the most fundamental objects used when 
analyzing DNA sequencing data. Many assembly algorithms (Zerbino 
and Bimey, 2008; Gnerre et al, 2011; Li et al, 2010) start by 
constructing a de Bruijn graph, a graph containing all fc-mers for 
some fixed fc. Some of the most commonly used algorithms for 
aligning reads to a reference genome start by finding short exact 
matches of a fixed length fc (commonly referred to as a seed); an 
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index of all fc-mers is constructed and from this index an initial 
alignment of a part of the read is found. 

Current methods for obtaining aggregate statistics of fc-mer 
data are based on keeping track of all fc-mers in a set of reads. 
Much work has been done on reducing memory requirements, 
based on exact or approximately correct methods of keeping track 
of a large set of fc-mers. This work includes using succinct 
set representations (Conway and Bromage, 2011) or probabilistic 
encodings such as Bloom filters (Pell et al, 2012; Chikhi and 
Rizk, 2012; Melsted and Pritchard, 201 1). Although the impact on 
memory usage is considerable, compared to previous approaches, 
these methods require storing all fc-mers, explicitly or implicitly, in 
memory. Thus the amount of resources will grow linearly with the 
input size. Many methods also rely on having access to all the reads 
for multiple passes over the data. Thus all of the above methods 
suffer from "the curse of deep sequencing" (Roberts et al., 2011) 
in which more sequencing can overwhelm the program in terms 
of memory usage and the algorithms simply fail to make use of 
increased amounts of data. 

We propose using streaming algorithms to solve this problem, a 
framework first proposed by Alon, Matias and Szegedy (Alon et al, 
1996). In this framework we assume that the algorithm has limited 
memory, compared to the size of the dataset, and data arrives one 
item at a time. The algorithm has limited time to process the item, 
usually constant time or logarithmic in the input, and cannot store all 
the data. Additionally, in the one pass setting, the algorithm can only 
observe each item once and can never look back into the stream. 
Several streaming algorithms have been discovered for a number 
of problems, such as counting the number of distinct items in the 
data stream using logarithmic memory (Bar-Yossef et al., 2002), 
and finding heavy hitters (Cormode and Hadjieleftheriou, 2010), 
i.e. frequent items that occur more than a fixed percentage of the 
time. Most of these algorithms are approximate, and probabilistic, 
one can specify the degree of the approximation and the probability 
of correctness beforehand, which will in turn affect the memory 
requirements and running time. The bioinformatics community has 
not widely adopted streaming algorithms. We suspect this is in 
part because of the probabilistic requirement and the complexity of 
implementation. However, we argue that probabilistic algorithms 
are ideally suited for sequencing data because a) the sequencing 
experiment is inherently probabilistic, it is unlikely that the 
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same dataset can be generated twice and b) if done properly 
the probability that the algorithms is correct can be taken as an 
input parameter, furthermore we can show that this probability is 
independent of the dataset. 

Sequencing errors cause considerable problems for fc-mer based 
algorithms for both assembly and read mapping; in assembly these 
may cause the assembly to become disconnected and increases both 
the memory usage and computational time. In read mapping it may 
lead to increased computational overhead and the true location of 
the read not being found. The removal or correction (Meacham 
et a/., 201 1; Schroder et ai, 2009) of erroneous bases and erroneous 
fragments is therefore a common pre-processing step in the analysis 
of DNA sequences, in particular when doing de novo assembly or 
read mapping to a reference assembly. Several programs (Li et al, 
2010; Kelley et al, 2010; Liu et al, 2013) use fc-mers to explicitly 
fix or remove sequencing errors in the dataset and it is recommended 
to use them prior to assembly (Salzberg et ai, 2012). 

A number of software programs have been written for quality 
control of DNA sequencing data, including FastQC (Andrews, 
2010). FastQC has a number of functionalities useful for quality 
control, including giving a distribution of the quality values assigned 
by the sequencer, quality distribution by position, N content and GC 
content, identifying overrepresented fc-mers and sequence length 
distribution. However, the fc-mers identified tend to be short, 6 
basepairs by default, and although they are useful for identifying 
contaminants they are not unique enough in the underlying genome 
to be useful for assessing the sequencing error rate, independent of 
what is given by the DNA sequencers. 

The main contribution of this paper is a streaming algorithm 
for estimating efficiently the number of fc-mers that occur exactly 
once in a data set, taking care of identifying the fc-mer with its 
reverse compliment. The time requirement for this algorithm is 
only a constant factor times the time taken to read the data and 
requires space that is only logarithmic in the size of the dataset. This 
method can be extended to give an estimate of the fc-mer abundance 
histogram. Additionally our algorithm reports frequency moments 
of the fc-mer count, which are aggregate statistics of the histogram. 

For experimental validation we show the results of running 
KmerStream on reads from a single lane of PhiX control sequences. 
We show the distribution accuracy of the estimator compared to the 
accurate fc-mer counts and that the estimators are almost always 
within the approximationi levels guaranteed. Additionally we a 
simple error model, both from the estimates KmerStream produces 
as well as the true fc-mer counts. Both error estimates are then 
compared to the true fc-mer error rate obtained from mapping fc- 
mers to the reference genome of PhiX 174. The results in section 4.1 
show that our simple error model underestimates the true error rate, 
however only by a few percent on average. 

As a simple application of our method we use it to estimate the the 
error rate in a sequencing dataset, conditioned on the per basepair 
quality value given by the instrument manufacturer. Unlike most 
other methods designed for estimating error of sequencing data, the 
method does not require the mapping of the data to a reference 
genome as a preprocessing step. The low computational overhead 
of our method makes it suitable to identify quality issues early on in 
the analysis pipeline. The method can therefore be used as a filtering 
step to determine the quality of a run, quickly determining errors 
before the read mapping or assembly stage. We ran the method on 
a set of 2656 whole genome sequenced individuals. Our results 



suggest that the error probabilities given by lUumina are largely 
accurate. Our results also show that once we have conditioned on 
an error probability given by the instrument manufacturer there 
is considerable variance in the sequencing error, suggesting that 
sequencing error needs to be reassessed on a per sample basis. We 
further construct a diploid genome containing all the called variants 
assigned to haplotypes and compute the number of fc-mers in this 
diploid genome. 



2 METHOD 

We take as input a collection of short reads S. For each read s G 5 we 
generate all substrings of length k in s, denoted as fc-mers. When reads 
contain other characters than ACGT, such as A', we split the read on the 
non-ACGT characters and consider all fc-mers in the split reads. For each 
fc-mer we also consider the reverse complement of that fc-mer, and in general 
make no distinction between the two when counting. 

A common task is to generate the abundance histogram for the fc-mers. We 
define fi to be the number of distinct fc-mers that appear i times in the set of 
reads. Since modern sequencing technologies are rarely strand-specific, we 
process each fc-mer and its reverse complement as if they were equal. The 
histogram is then simply a table of the fi values. 

Generating the exact histogram requires storing a large number of 
fc-mers in memory and can be done with fc-mer-counting tools such 
as Jellyfish (Mar^ais and Kingsford, 2011) or BFCounter (Melsted and 
Pritchard, 2011). Counting all fc-mers in high-throughput sequencing 
datasets requires tens, or even hundreds of gigabytes of memory, whereas 
the method we propose requires less than ten megabytes. Recently a method 
was proposed that generates an approximation of the histogram based on 
sampling fc-mers (Chikhi and Medvedev, 2013), this method however does 
not have a guaranteed error rate associated with it. 

We propose to not consider the exact histogram itself, but to compute 
statistics based on the histogram counts, which can be found using 
less memory and more speed than current methods by using streaming 
algorithms. 

One of the key statistics we are interested in is f\ , the number of singleton 
fc-mers, i.e. fc-mers that appeal' exactly once in the set of reads. Previous 
studies (Melsted and Pritchard, 2011) have shown that when a genome is 
being sequenced at relatively high coverage the majority of singleton fc- 
mers do not come from the genome, and are generated from sequencing 
errors. Also, the number of singleton fc-mers grows with increased coverage, 
whereas the number of fc-mers from the genome will approach a fixed 
number as coverage increases. 

Given the frequencies fi of the histogram, we define the fc-th frequency 
moment, Pj., as 

Fk = Y.'''-h (1) 

i 

In addition to the number of singleton fc-mers f\ , we are interested in three 
moments; Fq, Fi and F2. Fq is the number of distinct fc-mers in the set 
of reads and Fi is simply the number of fc-mers in the reads, counted with 
repetition. The second moment, F2 , puts a higher weight on the number of 
fc-mers that have high abundance values. 

For each of the frequency moments streaming algorithms have been 
developed, that can estimate their value to within a factor of ( 1 it £ ) with high 
probability using only O (£-2 ■ log(Af)) memory, where A'^ is the number 
of fc-mers. It should be noted that estimating the frequency is solved in the 
general setting of counting arbitrary items in a stream, but for the remainder 
of the paper we will focus exclusively on the fc-mer counting case. 

Estimating the second moment, F2, was first solved in the seminal 
work of Alon, Matias and Szegedy (Alon et al, 1996). This paper also 
formalized the framework and popularized the research field. The first 
rigorous estimator for the number of distinct elements, _Fo is from (Bar- 
Yossef et al, 2002), although earlier work from (Flajolet and Nigel Martin, 
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1985) applies as well. The first moment, Fi, is easiest to construct as we can 
maintain a single counter that is incremented once for each k-mer. 

2.1 Estimating Fq and /i 

To compute the /i estimator we use a hashing approach similar to the 
approach of Bar-Yossef et al. (Bar-Yossef etal., 2002). The high level idea of 
the algorithm is to sample the stream at different rates, and afterwards select 
the sampling rate that gives the best result. For each fc-mer a we compute 
the hash value h{a) and find the least significant 1 when the value is written 
in binary, e.g. if h(a) = OIIOIOO2 then the least significant 1 is in the third 
position. For the zero value we define the position to be 64, or the number 
of bits used to represent the value. If the location of this bit is w, then 2^^^ 
divides h{a) and the binary representation of h{a) ends in 1, followed by 
ui — 1 zeroes. Note that half the fc-mers will have ui = 1 a quarter will have 
w = 2, etc. 

The data structure we use is a list of an'ays, T, one for each potential 
value of w, we say that the array T^, is at level w, and we say that all fc-mers 
with value w hash to this level. Each array has a fixed size R that is only 
dependent on the error parameters e, and each element in the array is a 2-bit 
number storing values from 0 to 3. When processing a fc-mer a we look into 
the array . Conditioned on the value of w the w least significant bits are 
no longer random and so we discard the lowest w bits. We then use the result 
of the division, modulo the size of the amy, an index into the aiTay and 
increase the counter there. As we are only interested in /i , in the application 
presented here we limit the counter to two bits and if the counter is at 3, we 
do nothing. An extension of this work would be to allow the counter to count 
more bits, the algorithm could then be used to estimate fi for general i. 

When the number of fc-mers is much larger than R the array in Ti will 
almost certainly be full. i.e. all the values will be at 3. This indicates that 
sampling at a rate of one half is too small, and so we should look at lower 
sampling rates. Algorithm 1 estimates /i by relating it to the probability 
that a counter in an aiTay has value 0 or value 1 respectively. To get a good 
estimate for those probabilities we require that the number of distinct fc-mers 
that hash to the array is roughly of the same size as the array itself. Thus after 
we have processed all the fc-mers we can find the most appropiiate array 
to use. 

If we look at a fixed level w, the probability of a counter being 0 depends 
only on the number of distinct fc'-mers that hashed to this level, A'^^ . This 
probability should be ( 1 — ) , as the probability of each fc-mer hashing 
to this counter is . Note that the multiplicity of the fc-mer does not factor 
in here, since each fc-mer will always hash to the same counter every time 
it appears. If we can estimate this probability accurately, call it po. we can 
solve for Nn, to get an approximation on the number of fc-mers that hash to 
the level w. Note that solving 



for yields 



PO ■ 



(1_ _)iv™ 

^ r' 

_ In(po) 



(2) 



(3) 



stream of hash values we can adaptively sample at different rates and decide 
on the best rate to use after we have seen all the data when we have to report 
the answer. 

Up to this point we have followed the work of Bar-Yossef et al. (Bar- 
Yossef et al, 2002), the estimator here is identical to the one derived in 
Theorem 2 of (Bar-Yossef et al, 2002). We now show how to extend these 
results to obtain an accurate estimate of /i . 

We define a singleton to be a fc-mer that occurs once and let xi be the 
number of singletons that hash to level w. We note that each counter with a 
value of 1 has to be the result of a singleton fc-mer hashing to this counter, 
since for non-singleton fc-mers the counter would have been increased at 
least twice. On the other hand some singleton fc-mers will hash to the same 
counter as some other fc-mer so we must find a way to relate zi to some 
probability we can compute. The probability that a counter contains the value 
1 is then 



Pi 



"r 



(1- 



(5) 



, this is seen by choosing the singleton fc-mer that should hash to the counter 
in xi ways, it hashes to the counter with probability and all the other 
Nnj — 1 fc-mers must not hash to the counter with probability ( 1 — ) ^™ ~ ^ ■ 
Note that from (2) we see that the part (1 - -| ~ Ms equal to = 

PO- Thus we have pi = ^ pzrPO. simplifying and solving for xi 



R 
R-i 
yields 



XI = {R- 1) 



pi 
PO 



(6) 



We estimate pi in the same fashion as po, namely let ti be the number of 
counters in Tm equal to 1 and set pi = . 

We can expect the error in the estimates of the probabilities to be on the 
which will then translate to an error rate of similar magnitude 



order of 



for the /i estimator. 



Algorithm 1 Streaming algorithm for counting fc-mers 
function UPDATE(a) 

z <— h{a) 

w -f- Least significant bit of z 
i ^ [^^^\ (mod R) 
T™[j]^min(r„[j] + 1,3) 



> 0-based index 



function Estimate-Statistics 



w 
Po 
Pi 
fi 
Fo 



argmin™ |l{« 

|{i:T„,M=0}| 
R 

\{i:T^,li]^l}\ 



= 0}| — II [> select level 

> Probability estimates 



^ ' PO 

In(po) 
In(l-Jj) 

return (/i,i^o) 



2™ 
- 2" 



Since the sampling rate is s^r we can use this to approximate the number 



of empty counters in the array Tw ■ The estimator of the number of distinct 
fc-mers is then 



Fo 



(4) 



1^(1 -is) 

Although this would work for any value of w, i.e. all values of w have the 
same expectation, different values of ui have different variance. If we use 
w = 1, then nearly all values in T„ will be non-zero, giving a poor point 
estimate of po . Similarly if ui is too high, so that only a handful of elements 
map to Tni the estimate for po will be poor. In general we want our estimates 
to be bounded away from 0 and 1, to this end we select the level w* where 
the fraction of empty counters is as close to 50% as possible. By thinning the 



We formalize the selection of the optimal value of w in Theorem 1 , the 
proof of this theorem can be found in the supplement for the paper. 

Theorem 1. If fi > ^ then Algorithm I finds fi such that (1 - 
£)f\ < /i < (1 + £)/i with probability at least (1 — S). The algorithm 
memory and 0(1) time per update, where N 
is the number of elements in the stream. 

Unlike for most statistical estimators, the theorem does not derive a 
distribution for the f\ statistic, but rather a bound on the probability that 
the point estimate deviates too far from the true value. In fact the estimator is 
slightly biased, since it uses the ratio of two unbiased estimators, po and pi 
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in Algorithm 1 . However this bias is guaranteed to be smaller than the error 
rate e used to select the parameter _R. 

We note that this method can be extended to obtain estimates of higher 
frequencies, fi for j > 2. As an example for /2, we note that if a 
counter has the value 2, this can only be obtained from two independent 
singletons mapping to the counter or one fc-mer with frequency 2. The first 

happens with probability ~ ij) and the latter with probability 

^ (l — -jj)^ ^, where xi and X2 are the number of fc-mers with 
frequency 1 and 2 that map to a level. Given an estimate for and xi we 
can solve to obtain an estimate for X2 ■ This scheme could be generalized for 
higher values of i to obtain estimates of fi, although obtaining a guarantee 
on the error rate is left as an open problem. 

2.2 Estimation of error rate 

We can use our results to estimate error rates of a genomic sequencing 
experiment. For this we will focus on the fc-mer error rate, i.e. given a k- 
mer from a read what is the probability that is did not originate from the 
DNA sequenced. This rate is higher than the basepair error rate, since we 
require all of the k basepairs to be intact. If the sequencing errors in a read 
were independently distributed, it would be easy to convert the fc-mer eiTor 
rate to a basepair error rate. However sequencing errors are not independent, 
generally the ends of the higher error rates and sequencing errors can come 
in batches. Thus we will focus on the fc-mer error rate in this discussion and 
note that converting it to a basepair error rate assuming independence will 
lead to an underestimate of the true basepair error rate. 

We use a simple generative model where total number of sampled fc-mers 
with repetition, namely Fi, is given. Each fc-mer comes from a random 
position in the genome of size G, and with probability Sk it contains a 
sequencing eiTor and is en'or-free with probability 1 — ej. . To account for 
repeated errors, we further assume that each erroneous fc'-mer is produced 
by sampling a random basepair and changing it to one of the three other 
nucleotides so that it does not match the reference. The number of fc-mers 
sampled at each location follows a Poisson distribution with mean A and 
so the coverage of an error-free fc-mer is Poi(A(l — e^). For a fixed 
eiToneous fc-mer the coverage follows a Poi(^^|^), assuming that there is 
only one possible position in the genome that could produce this error Our 
model has three parameters, G, A and, e^. To obtain estimates for these 
parameters we require three statistics, namely /i , Fi and, Fg . Given the 
Poisson distributions of the coverages and the number of possible fc-mers G 
for the error-free and G ■ 3 ■ fc for the single error fc-mers, we can derive the 
following equations 

Fo = G-3k(^l- e-lfir^ + g ■ (l - £-^(1"'='=)) (7) 

Fi = A ■ G (8) 

/i = G ■ 3fc ■ w + GA{1 - 

3fe 

= GA£fce-Tnr +GA(l-£fc)e-^(i-=''' (9) 
We can isolate A from (8) and G from (7) to obtain 

A^§.|(3fc(l-e-^).(l-e--.))) (,0) 
Similarly we derive by canceling the common factor GA from (9) using (8) 

^ =efee-^ +(l-efc)e-^(i--'') (11) 

Using equations (10) and (1 1) we can solve for A and ek numerically to 
obtain an estimate of the coverage, fc-mer error rate and finally the genome 
size. 

We make a couple of observations about this model. First, the assumption 
that all error fc-mers are solely due to single nucleotide differences causes 
us to slightly underestimate the true error rate. Second, a sequencing error 



in one fc-mer may not be detected as an error because the error fc'-mer also 
occurs as a true fc-mer in the genome. As we have chosen fc = 31 then in 
random DNA the probability that a eiTor fc-mer also occurs in the genome 
is less than 10^'-*. Third, the same error fc-mer can occur in two distinct 
reads by chance given high enough coverage, this is modeled in our fc-mer 
distribution model via the Poisson vaiiables. However the model does not 
take into account fc-mers that have a high repeat count, due to repetitive 
regions or systematic errors in sequencing. According to our computations 
96.6% of the genome of 31-mers in the genome occur exactly once or twice 
on a diploid genome. It has been observed that in practice sequencing reads 
are not randomly sampled from the genome and the true distribution is not a 
Poisson distribution. In particular the coverage of genomic regions is known 
to be dependent on the GC rate of the basepairs (Minoche et ah, 2011). 
Meacham et al. (Meacham et ah, 2011) have shown that the error rate is 
site specific, i.e. that sequencing error rate varies with the location being 
considered. Nonetheless we show in section 4. 1 that this simplified model 
gives a good estimate of the fc-mer error rate using only three statistics from 
the fc-mer distribution, rather than the entire histogram. 

3 COMPUTATIONAL EXPERIMENTS 

3.1 Implementation 

The software is implemented in C-l~l- and can read FASTQ, 
compressed FASTQ and BAM files. The user can select the fc-mer 
size used, the error rate e and additionally it allows for filtering fc- 
mers based on quality scores. For the quality filtering, all bases in 
a read that do not meet the cutoff are discarded, thus only fc-mers 
where all the bases in the fc-mer have good quality values are kept. 

3.2 Setup 

We ran our algorithm on a data set of 2656 whole genome 
sequenced individuals, using Illumina HiSeq sequencers. The 
individuals being sequenced had an average coverage of 15. 9x 
and a minimum coverage of 6x. All of these data are sequenced 
under the same conditions at the same laboratory and have already 
undergone a number of quality control procedures (Styrkarsdottir 
et al, 2013). We would expect these data to have comparable error 
characteristics. 

All reads that were labeled as "qc-failure" or "optical or per 
duplicate" were discarded from our analysis. For the remaining 
reads we considered all fc-mers that only contained basepairs that 
had a q-value above a fixed threshold. Hence, if a read contains a 
basepair with a q-score less than the given threshold only fc-mers 
to the left and to the right of the basepair were considered. Our 
computations were run for q-score thresholds of 0, 13, 20 and 30, 
corresponding to basepair error rates of 1, 0.05, 0.01 and 0.001, 
respectively. Assuming independent error rates these basepair error 
rates should correspond to a lower bound of 100.0%, 79.6%, 26.8%, 
and 3.1% for the fc-mer error rate, using the value fc = 31. As an 
example, when using a q-score threshold of 30 then, according to 
the annotation given by the instrument manufacturer, all basepairs 
considered should have an error rate of less than 0.001 or 0. 1 %. As 
many of the basepairs will have even lower annotated error rates, 
then if the annotation of the manufacturer is correct, the average 
error rate should be even lower. 

3.3 Counting fc-mers in the human genome 

We constructed the diploid genome of a single individual. As input 
we used Human reference genome build 36, genotypes called in that 
individual as determined using the GATK (DePristo et al., 2011; 
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Table 1. Software comparison for running time and memory usage. 



Software 


Parameter (k) 


Dataset 


Time 


Memory 


KmerStream 


21 


H. Sapiens, chrl4 


194s 


IIM 


KmerGenie 


21 


H. Sapiens, chrl4 


580s 


127M 


KmerStream 


31 


H. Sapiens, chrl4 


220s 


IIM 


KmerGenie 


31 


H. Sapiens, chrl4 


703s 


126M 


KmerStream 


31 


B. Impatiens 


3120s 


IIM 


KmerGenie 


31 


B. Impatiens 


7620s 


114M 



McKenna et al, 2010) genotype caller and the assignment of these 
genotypes to haplotypes using long range phasing (Kong et al, 
2008). From these we constructed two copies of each chromosome. 
We estimate a total of 2.552G 31-mers in this diploid genome, we 
observe that 4.9% of the 31-mers only occur once in the diploid 
genome, indicating that they overlap a polymorphic region, 91.7% 
of the 31-mers occur twice, indicating that the region is non- 
polymorphic, 0.1% occur three times, 1.9% of the 31-mers occur 
four times and 1.4% occur more than four times. 



Fq estimate vs. true 



fi estimate vs. true 



I — I — \ — I — I — I 



I — I — I — \ — I 



0.97 0.99 1.01 0.96 1.00 1.04 

Estimated vs. true error rates Estimated vs. true error rates 



I — I — I — \ — I — I — I — I I — I — I — \ — I 

0.88 0.92 0.96 1.00 0.92 0.96 1.00 



4 RESULTS 

For running time and memory, we compare our software, 
KmerStream, to using KmerGenie on the H. Sapiens chr 14 
(36M reads) and B. Impatiens (303 M reads), datasets from 
GAGE (Salzberg et al, 2012). All tests were run on a Intel Xeon 
E-2650 16-core processors at 2.4 GHz with 128 Gb of memory. For 
the time comparison we only ran the histogram estimation step, the 
program specialk, and not the entire pipeline for KmerGenie. 

It should be noted that the time to read the H. Sapiens dataset 
was 60 seconds and 1800 seconds for the compressed B. Impatiens 
dataset. Our experiments show that KmerStream is at least 2-3 times 
as fast as the previous approach and the memory usage is an order 
of magnitude better. 

4.1 PhiX174 Validation 

To validate the accuracy of the fc-mer error rate model proposed in 
section 2.2 we used sequencing reads obtained from a PhiX control 
lane. The sequencing data from the control has 363M reads, while 
the genome size of PhiXI74 is only 5386 nucleotides. Due to the 
small genome size, reads were sampled so that the per basepair 
coverage would be 30-fold. This was necessary to scale down 
the coverage to values the better correspond to normal sequencing 
coverage used for whole genome sequencing. This was repeated 
1000 times and each computational experiment run independently. 
For each input we used fc = 31 and classified all fc-mers in 
reads as true or false, depending on whether they appeared in the 
reference sequence or not. Separate histograms were generated for 
both classes of fc-mers. We then ran KmerStream for each input 
obtaining estimates for /i , F\ and Fq . The top row of figure 1 shows 
the distribution of relative accuracy for /i and Fq respectively. The 
program was run with a default error guarantee of 2% for Fq, since 
for this input /i /Fo is about 0.5 this results in a 4% error guarantee 
of /i . From the distribution of figure 1 (top row), we see that the 



Fig. 1. Relative accuracy of KmerStream. Top row: accuracy for _Fo and 
/i respectively. Bottom row: relative accuracy of fc-mer eiTor rate based on 
fc-mer estimates (left) and true fe-mer numbers (right). 

vast majority of the experiments have estimates within the error 
guaranteed. 

The error rate model was fitted using both the true values for the 
statistics as well as the estimated values from KmerStream. The true 
fc-mer error rate was obtained by matching fc-mers to the PhiX 174 
reference genome. 

Given the number of caveats listed when deriving our simplified 
model, as well as obtaining estimates from only three key statistics, 
we are pleasantly surprised to see how well the model fits the actual 
results. Based on the true values for the fc-mer statistics we see 
from figure 1 (bottom row, right) that our model underestimates 
the fc-mer error rate by 4% on average. When using the estimated 
values from KmerStream, figure 1 (bottom row, left) shows that we 
underestimate the error rate by 4% on average, same as using the 
true fc-mer statistics, but we observe an increase in the variance of 
the accuracy of our estimate. This increase in variance is primarily 
because of the underlying variance in our estimate of the fc-mer 
statistics. Of course the KmerStream method is much more efficient 
than obtaining the accurate counts, both in terms of runtime and 
memory. 

4.2 Error rates 

To estimate the error rates we ran a set of experiments on a set 
of 2656 whole genome sequenced individuals. Figure 2 shows a 
histogram of the average per fc-mer error rate for different q- value 
thresholds and in Table 2 we give critical values of the distribution of 
the read error rate. We observe that without any filtering on q- values, 
on average 5.9% of the fc-mers are estimated to contain errors. This 
number varies considerably between samples and 5% of the samples 
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Table 2. Percentiles of fc-mer eiTor 
rates at different q-value tliresholds for 
2656 individuals. 



Percentile 



Threshold 


5 


50 


95 


qO 


3.7% 


5.9% 


11.4% 


ql3 


0.6% 


1.1% 


2.2%, 


q20 


0.4% 


0.8% 


1.8% 


q30 


0.3% 


0.8%) 


2.2% 



have an estimated error rate of 11.5% or higher, while 5% of the 
samples have an estimated error rate of 3.7% or lower. 

We observe that the error rate decreases with a higher q-value 
threshold. We observe a marked uniform decline in the fraction 
of error fc-mers when the threshold is increased from 0 to 13, an 
average reduction of 80%.. A smaller, but clear, decrease is observed 
when the q-value threshold is increased from 13 to 20, 25% on 
average. However when comparing increasing the threshold from 
20 to 30 we observed an average change of —0.9%, but a median 
of 9%. Additionally this decrease in the number of error fc-mers is 
not uniform and in 44% of all individuals the estimated error rate 
increases when the q-value threshold is increased from 20 to 30, 
i.e. using q-value thresholds larger than 20 does not necessarily give 
one lower error rate fc-mers. Furthermore the number of fc-mers, 
i.e. the coverage of the data set was reduced by 25% on average by 
increasing the q-value threshold from 20 to 30. 

In Figure 2 we consider only the fc-mers that are removed when 
the error threshold is increased, giving a histogram of the fraction 
of those that are error fe-mers. We observe that when increasing 
the q-value threshold from 0 to 13, on average an estimated 35% 
of the fc-mers removed are error fc-mers. This suggests that unless 
the algorithm being used for analysis is highly robust to errors 
then using a q-value threshold lower than 13 will in many cases 
be problematic. On average of 85% of all fc-mers in our dataset have 
q-value at all bases greater than 13, indicating that not considering 
fc-mers with a q-value threshold less than 13 has limited impact on 
the number of fc-mers being considered while it has a large impact 
on the fraction of fc-mers that are error free. When we increase the 
q-value threshold from 13 to 20 an average of 5% of the fc-mers 
removed are error fc-mers and when increasing the threshold from 
20 to 30 an average of 1 .5% of the fc-mers removed are error fc-mers. 



5 DISCUSSION 

The amount of data being gathered with modem sequencing 
methods continues to grow at a faster rate than our ability to 
analyze and store the data. An alternative view to the current state 
of the art is to consider technologies that sequence DNA "on the 
fly". In this case the sequencing machine does not store all the 
results, but rather transmits the sequence reads as they are generated. 
Technologies that fit this framework have been proposed (Branton 
et al, 2008; Clarke et al, 2009) and are currently in development, 
such as technologies from Oxford Nanopore, but technical details 
are limited at this point in time. Regardless of the exact technologies 



qO q13 




I — I — I — I — I — I — I — I — I I 1 1 1 1 

0.00 0.12 0.24 0.00 0.04 0.08 

q20 q30 




I 1 1 1 1 I — I — I — I — I — I — I 

0.00 0.04 0.08 0.00 0.04 0.08 0.12 



Fig. 2. fc-mer error rates for different quality filters for 2656 individuals. 



used, this new sequencing paradigm opens up new opportunities 
for online or streaming analysis of the data, where we bypass the 
storage requirements, and simply plug the sequencing directly into 
the analysis. The upshot of developing streaming algorithms for 
analysis of genetics datasets is that they are not only efficient in 
terms of computational resources, but also future-proof in terms of 
new sequencing paradigms. 

One benefit of the algorithms we have developed is as follows; 
that Fo — /i is a crude estimate of the number of fc-mers that 
have been sequenced at least twice, when this number goes above 
a certain fraction of the genome size we can decide to stop 
sequencing. Another benefit is that when the error rate goes above 
some threshold we can decide to stop the experiment immediately, 
not wasting our time on failed experiments. The method presented 
here can be particularly useful when used for a species that has 
not been previously sequenced, allowing us to get an estimate the 
coverage of this genome while sequencing prior to assembly. 

When we condition on the error rate given by Illumina we 
see considerable variability in the error rate between individuals. 
Hence, it is not advisable to use the error rates in a model without 
considering differences between individuals. 

Our results show that although the base pair quality values 
given by the instrument manufacturer are largely correct, there 
appears to be a considerable sample dependent difference in the 
the error rate conditioned on the base pair quality rate reported 
by the manufacturer. Our recommendation based on the results of 
sequencing 2656 individuals is to estimate both the number of fc- 
mers Fo as well as the coverage and fc-mer error rate for multiple 
q-value thresholds and decide on a case by case basis. 
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